1 



Indexability of Restless Bandit Problems and Optimality 
of Whittle's Index for Dynamic Multichannel Access 

Keqin Liu, Qing Zhao 
University of California, Davis, CA 95616 
kqliu @ ucdavis.edu, qzhao @ ece . ucdavis.edu 



Abstract 

We consider a class of restless multi-armed bandit problems (RMBP) that arises in dynamic 
multichannel access, user/server scheduling, and optimal activation in multi-agent systems. For this class 
of RMBP, we establish the indexability and obtain Whittle's index in closed-form for both discounted 
and average reward criteria. These results lead to a direct implementation of Whittle's index policy 
with remarkably low complexity. When these Markov chains are stochastically identical, we show that 
Whittle's index policy is optimal under certain conditions. Furthermore, it has a semi-universal structure 
that obviates the need to know the Markov transition probabilities. The optimality and the semi-universal 
structure result from the equivalency between Whittle's index policy and the myopic policy established 
in this work. For non-identical channels, we develop efficient algorithms for computing a performance 
upper bound given by Lagrangian relaxation. The tightness of the upper bound and the near-optimal 
performance of Whittle's index policy are illustrated with simulation examples. 
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Opportunistic access, dynamic channel selection, restless multi-armed bandit. Whittle's index, in- 
dexability, myopic policy. 
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I. Introduction 

A. Restless Multi-armed Bandit Problem 

Restless Multi-armed Bandit Process (RMBP) is a generalization of the classical Multi-armed 
Bandit Processes (MBP), which has been studied since 1930's [1]. In an MBP, a player, with 
full knowledge of the current state of each arm, chooses one out of N arms to activate at each 
time and receives a reward determined by the state of the activated arm. Only the activated arm 
changes its state according to a Markovian rule while the states of passive arms are frozen. The 
objective is to maximize the long-run reward over the infinite horizon by choosing which arm 
to activate at each time. 

The structure of the optimal policy for the classical MBP was established by Gittins in 1979 [2], 
who proved that an index policy is optimal. The significance of Gittins' result is that it reduces 
the complexity of finding the optimal policy for an MBP from exponential with N to linear with 
A^. Specifically, an index policy assigns an index to each state of each arm and activates the arm 
whose current state has the largest index. Arms are decoupled when computing the index, thus 
reducing an A^— dimensional problem to N independent 1— dimensional problems. 

Whittle generalized MBP to RMBP by allowing multiple {K > 1) arms to be activated simul- 
taneously and allowing passive arms to also change states [3]. Either of these two generalizations 
would render Gittins' index policy suboptimal in general, and finding the optimal solution to 
a general RMBP has been shown to be PSPACE-hard by Papadimitriou and Tsitsiklis [4]. In 
fact, merely allowing multiple plays {K > 1) would have fundamentally changed the problem 
as shown in the classic work by Anantharam et al. [5] and by Pandelis and Teneketzis [6]. 

By considering the Lagrangian relaxation of the problem. Whittle proposed a heuristic index 
policy for RMBP [3]. Whittle's index policy is the optimal solution to RMBP under a relaxed 
constraint: the number of activated arms can vary over time provided that its average over 
the infinite horizon equals to K. This average constraint leads to decoupling among arms, 
subsequently, the optimality of an index policy. Under the strict constraint that exactly K arms 
are to be activated at each time, Whittle's index policy has been shown to be asymptotically 
optimal under certain conditions (N ^ oo stochastically identical arms) [7]. In the finite regime, 
extensive empirical studies have demonstrated its near-optimal performance, see, for example, 
[8], [9]. 
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The difficulty of Whittle's index policy lies in the complexity of establishing its existence 
and computing the index, especially for RMBP with uncountable state space as in our case. 
Not every RMBP has a well-defined Whittle's index; those that admit Whittle's index policy are 
called indexable [3]. The indexability of an RMBP is often difficult to establish, and computing 
Whittle's index can be complex, often relying on numerical approximations. 

In this paper, we show that for a significant class of RMBP most relevant to multichannel 
dynamic access applications, the indexability can be established and Whittle's index can be 
obtained in closed form. For stochastically identical arms, we establish the equivalency between 
Whittle's index policy and the myopic policy. This result, coupled with recent findings in [10], 
[11] on the myopic policy for this class of RMBP, shows that Whittle's index policy achieves the 
optimal performance under certain conditions and has a semi-universal structure that is robust 
against model mismatch and variations. This class of RMBP is described next. 

B. Dynamic Multichannel Access 

Consider the problem of probing independent Markov chains. Each chain has two states — 
"good" and "bad" — with different transition probabilities across chains (see Fig. \\\). At each 
time, a player can choose K {1 < K < N) chains to probe and receives reward determined by 
the states of the probed chains. The objective is to design an optimal policy that governs the 
selection of K chains at each time to maximize the long-run reward. 




Fig. 1. The Gilber-Elliot channel model. 

The above general problem arises in a wide range of communication systems, including cog- 
nitive radio networks, downlink scheduling in cellular systems, opportunistic transmission over 
fading channels, and resource-constrained jamming and anti-jamming. In the communications 
context, the independent Markov chains corresponds to communication channels under the 
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Gilbert-Elliot channel model [12], which has been commonly used to abstract physical channels 
with memory (see, for example, [13], [14]). The state of a channel models the communication 
quality of this channel and determines the resultant reward of accessing this channel. For example, 
in cognitive radio networks where secondary users search in the spectrum for idle channels 
temporarily unused by primary users [15], the state of a channel models the occupancy of the 
channel. For downlink scheduling in cellular systems, the user is a base station, and each channel 
is associated with a downlink mobile receiver. Downlink receiver scheduling is thus equivalent 
to channel selection. 

The application of this problem also goes beyond communication systems. For example, it 
has applications in target tracking as considered in [16], where K unmanned aerial vehicles are 
tracking the states of {N > K) targets in each slot. 

C. Main Results 

Fundamental questions concerning Whittle's index policy since the day of its invention have 
been its existence, its performance, and the complexity in computing the index. What are the 
necessary and/or sufficient conditions on the state transition and the reward structure that make 
an RMBP indexable? When can Whittle's index be obtained in closed-form? For which special 
classes of RMBP is Whittle's index policy optimal? When numerical evaluation has to be resorted 
to in studying its performance, are there easily computable performance benchmarks? 

In this paper, we attempt to address these questions for the class of RMBP described above. 
As will be shown, this class of RMBP has an uncountable state space, making the problem 
highly nontrivial. The underlying two-state Markov chain that governs the state transition of 
each arm, however, brings rich structures into the problem, leading to positive and surprising 
answers to the above questions. The wide range of applications of this class of RMBP makes 
the results obtained in this paper generally applicable. 

Under both discounted and average reward criteria, we establish the indexability of this class 
of RMBP. The basic technique of our proof is to bound the total amount of time that an arm is 
made passive under the optimal policy. The general approach of using the total passive time in 
proving indexability was considered by Whittle in [3] when showing that a classic MBP is always 
indexable. Applying this approach to a nontrivial RMBP is, however, much more involved, and 
our proof appears to be the first that extends this approach to RMBP. We hope that this work 
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contributes to the set of possible techniques for estabUshing indexabiUty of RMBP. 

Based on the indexability, we show that Whittle's index can be obtained in closed-form for 
both discounted and average reward criteria. This result reduces the complexity of implementing 
Whittle's index policy to simple evaluations of these closed-form expressions. This result is 
particularly significant considering the uncountable state space which would render numerical 
approaches impractical. The monotonically increasing and piecewise concave (for arms with 
Pii > Poi) or piecewise convex (for arms with pu < poi) properties of Whittle's index are also 
established. The monotonicity of Whittle's index leads to an interesting equivalency with the 
myopic policy — the simplest nontrivial index policy — when arms are stochastically identical. 
This equivalency allows us to work on the myopic index, which has a much simpler form, when 
establishing the structure and optimality of Whittle's index policy for stochastically identical 
arms. 

As to the performance of Whittle's index policy for this class of RMBP, we show that 
under certain conditions. Whittle's index policy is optimal for stochastically identical arms. 
This result provides examples for the optimality of Whittle's index policy in the finite regime. 
The approximation factor of Whittle's index policy (the ratio of the performance of Whittle's 
index policy to that of the optimal policy) is analyzed when the optimality conditions do not 
hold. Specifically, we show that when arms are stochastically identical, the approximation factor 
of Whittle's index policy is at least ^ when pn > poi and at least max{|, ^} when pn < poi- 

When arms are non-identical, we develop an efficient algorithm to compute a performance 
upper bound based on Lagrangian relaxation. We show that this algorithm runs in at most 
0(A^(log A^)^) time to compute the performance upper bound within e-accuracy for any e > 0. 
Furthermore, when every channel satisfies pn < poi> we can compute the upper bound without 
error with complexity 0(A^^ log A'^). 

Another interesting finding is that when arms are stochastically identical. Whittle's index policy 
has a semi-universal structure that obviates the need to know the Markov transition probabilities. 
The only required knowledge about the Markovian model is the order of pn and poi- This semi- 
universal structure reveals the robustness of Whittle's index policy against model mismatch and 
variations. 
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D. Related Work 

Multichannel opportunistic access in the context of cognitive radio systems has been studied 
in [17], [18] where the problem is formulated as a Partially Observable Markov Decision Process 
(POMDP) to take into account potential correlations among channels. For stochastically identical 
and independent channels and under the assumption of single-channel sensing {K — 1), the 
structure, optimality, and performance of the myopic policy have been investigated in [10], 
where the semi-universal structure of the myopic policy was established for all and the 
optimality of the myopic policy proved for = 2. In a recent work [11], the optimality of 
the myopic policy was extended to iV > 2 under the condition of pu > poi- These results 
have also been extended to cases with probing errors in [19]. In this paper, we establish the 
equivalence relationship between the myopic policy and Whittle's index policy when channels 
are stochastically identical. This equivalency shows that the results obtained in [10], [11] for the 
myopic policy are directly applicable to Whittle's index policy. Furthermore, we extend these 
results to multichannel sensing (K > 1). 

Other examples of applying the general RMBP framework to communication systems include 
the work by Lott and Teneketzis [20] and the work by Raghunathan et al. [21]. In [20], the 
problem of multichannel allocation for single-hop mobile networks with multiple service classes 
was formulated as an RMBP, and sufficient conditions for the optimality of a myopic-type index 
policy were established. In [21], multicast scheduling in wireless broadcast systems with strict 
deadlines was formulated as an RMBP with a finite state space. The indexability was established 
and Whittle's index was obtained in closed-form. Recent work by Kleinberg gives interesting 
applications of bandit processes to Internet search and web advertisement placement [22]. 

In the general context of RMBP, there is a rich literature on indexability. See [23] for the 
linear programming representation of conditions for indexability and [9] for examples of specific 
indexable restless bandit processes. Constant-factor approximation algorithms for RMBP have 
also been explored in the literature. For the same class of RMBP as considered in this paper, 
Guha and Munagala [24] have developed a constant-factor (1/68) approximation via LP relaxation 
under the condition of Pn > | > Poi for each channel. In [25], Guha et al. have developed a 
factor-2 approximation policy via LP relaxation for the so-called monotone bandit processes. 

In [16], Le Ny et al. have considered the same class of RMBP motivated by the applications 



of target tracking. They have independently established the indexability and obtained the closed- 
form expressions for Whittle's index under the discounted reward criterioijl. Our approach to 
establishing indexability and obtaining Whittle's index is, however, different from that used 
in [16], and the two approaches complement each other. Indeed, the fact that two completely 
different applications lead to the same class of RMBP lends support for a detailed investigation of 
this particular type of RMBP. We also include several results that were not considered in [16]. 
In particular, we consider both discounted and average reward criterion, develop algorithms 
for and analyze the complexity of computing the optimal performance under the Lagrangian 
relaxation, and establish the semi-universal structure and the optimality of Whittle's index policy 
for stochastically identical arms. 



E. Organization 

The rest of the paper is organized as follows. In Sec. HH the RMBP formulation is presented. 
In Sec. imi we introduce the basic concepts of indexability and Whittle's index. In Sec. |Wl 
we address the total discounted reward criterion, where we establish the indexability, obtain 
Whittle's index in closed-form, and develop efficient algorithms for computing an upper bound 
on the performance of the optimal policy. Simulation examples are provided to illustrate the 
tightness of the upper bound and the near-optimal performance of Whittle's index policy. In 
Sec. |Vl we consider the average reward criterion and obtain results parallel to those obtained 
under the discounted reward criterion. In Sec. |VIl we consider the special case when channels are 
stochastically identical. We show that Whittle's index policy is optimal under certain conditions 
and has a simple and robust structure. The approximation factor of Whittle's index policy is also 
analyzed. Sec. IVIII concludes this paper. 

II. Problem Statement and Restless Bandit Formulation 

A. Multi-channel Opportunistic Access 

Consider independent Gilbert-Elliot channels, each with transmission rate Bi{i = 1, ■ ■ ■ , A^). 
Without loss of generality, we normalize the maximum data rate: maxjgji 2,... ,Ar}{-Bi} = 1. The 
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State of channel i — "good"(l) or "bad"(0) — evolves from slot to slot as a Markov chain with 
transition matrix Pj = {p^jl}j,k&{o,i} as shown in Fig. [TJ 

At the beginning of slot t, the user selects K out of channels to sense. If the state Si{t) 
of the sensed channel i is 1, the user transmits and collects Bi units of reward in this channel. 
Otherwise, the user collects no reward in this channel. Let U{t) denote the set of k channels 
chosen in slot t. The reward obtained in slot t is thus given by 

Ru{t){t) = ^ieU{t)Si{t)Bi. 

Our objective is to maximize the expected long-run reward by designing a sensing policy that 
sequentially selects K channels to sense in each slot. 

B. Restless Multi-armed Bandit Formulation 

The channel states [5*1 (t), S'Ar(t)] G {0, 1}^ are not directly observable before the sensing 
action is made. The user can, however, infer the channel states from its decision and observation 
history. It has been shown that a sufficient statistic for optimal decision making is given by the 
conditional probability that each channel is in state 1 given all past decisions and observations 
[26]. Referred to as the belief vector or information state, this sufficient statistic is denoted by 
Q{t) = [uJi(t), ■ ■ ■ ,u!N{t)], where uJi(t) is the conditional probability that Si(t) = 1. Given the 
sensing action U (t) and the observation in slot t, the belief state in slot t + 1 can be obtained 
recursively as follows: 



Ui{t + 1) 



P?„ zeU{t),S,{t) = l 

ptl teU{t),S,{t) = , (1) 



where 



denotes the operator for the one-step belief update for unobserved channels. 

If no information on the initial system state is available, the i-th entry of the initial belief 
vector can be set to the stationary distribution Uo'^ of the underlying Markov chain: 

Pol + Pio 
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It is now easy to see that we have an RMBP, where each channel is considered as an arm and 
the state of arm i in slot t is the belief state LOi(t). The user chooses an action U{t) consisting 
of K arms to activate (sense) in each slot, while other arms are made passive (unobserved). The 
states of both active and passive arms change as given in ([T]). 

A policy TT : ^l{t) U (t) is a function that maps from the belief vector n{t) to the action U (t) 
in slot t. Our objective is to design the optimal policy vr* to maximize the expected long-term 
reward. 

There are two commonly used performance measures. One is the expected total discounted 
reward over the infinite horizon: 

E^[EZ,P'-'R.in(t)Mn{l)], (3) 

where < /3 < 1 is the discount factor and RTT{n{t))it) is the reward obtained in slot t under 
action U{t) = 7r(fi(t)) determined by the policy tt. This performance measure applies when 
rewards in the future are less valuable, for example, in delay sensitive communication systems. 
It also applies when the horizon length is a geometrically distributed random variable with 
parameter p. For example, a communication session may end at a random time, and the user 
aims to maximize the number of packets delivered before the session ends. 

The other performance measure is the expected average reward over the infinite horizon [27]: 

E.[lim ^Ej^.K^nmitmi)]- (4) 

This is the common measure of throughput in the context of communications. 

For notation convenience, let (11(1), {Pij^^, /?) denote the RMBP with the dis- 

counted reward criterion, and (^^(1), {Pi}^i, {-Bij^i, 1) the RMBP with the average reward 
criterion. 

III. INDEXABILITY AND INDEX POLICIES 
In this section, we introduce the basic concepts of indexability and Whittle's index policy. 

A. Index Policy 

An index policy assigns an index for each state of each arm to measure how rewarding it is 
to activate an arm at a particular state. In each slot, the policy activates those K arms whose 
current states have the largest indices. 
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For a strongly decomposable index policy, the index of an arm only depends on the character- 
istics (transition probabilities, reward structure, etc.) of this arm. Arms are thus decoupled when 
computing the index, reducing an A^— dimensional problem to N independent 1— dimensional 
problems. 

A myopic policy is a simple example of strongly decomposable index policies. This policy 
ignores the impact of the current action on the future reward, focusing solely on maximizing the 
expected immediate reward. The index is thus the expected immediate reward of activating an 
arm at a particular state. For the problem at hand, the myopic index of each state uji{t) of arm 
i is simply uji{t)Bi. The myopic action U{t) under the belief state Vl{t) = [uJi{t)., ■ ■ ■ ,ujNit)] is 
given by 

U(t) = argmaxEig[/(t)CJi(t)Si. (5) 

U{t) 

B. Indexability and Whittle's Index Policy 

To introduce indexability and Whittle's index, it suffices to consider a single arm due to the 
strong decompos ability of Whittle's index. Consider a single-armed bandit process (a single 
channel) with transition probabilities {pj,k}j,keo,i ^iid bandwidth B (here we drop the channel 
index for notation simplicity). In each slot, the user chooses one of two possible actions — 
u E {0 (passive), 1 (active)} — to make the arm passive or active. An expected reward of cuB 
is obtained when the arm is activated at belief state u, and the belief state transits according 
to (HI). The objective is to decide whether to active the arm in each slot to maximize the total 
discounted or average reward. The optimal policy is essentially given by an optimal partition 
of the state space [0, 1] into a passive set {u; : u*(uj) = 0} and an active set {lu : u*{uj) = 1}, 
where u*{uj) denotes the optimal action under belief state to. 

Whittle's index measures how attractive it is to activate an arm based on the concept of 
subsidy for passivity. Specifically, we construct a single-armed bandit process that is identical 
to the above specified bandit process except that a constant subsidy m is obtained whenever 
the arm is made passive. Obviously, this subsidy m will change the optimal partition of the 
passive and active sets, and states that remain in the active set under a larger subsidy m are 
more attractive to the user. The minimum subsidy m that is needed to move a state from the 
active set to the passive set under the optimal partition thus measures how attractive this state 
is. 
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We now present the formal definition of indexability and Whittle's index. We consider the 
discounted reward criterion. Their definitions under the average reward criterion can be similarly 
obtained. 

Denoted by Va^(co'), the value function represents the maximum expected total discounted 
reward that can be accrued from a single-armed bandit process with subsidy m when the initial 
belief state is cu. Considering the two possible actions in the first slot, we have 

= max{Va,^(u;; u = 0), Va,m(t^; u = 1)}, (6) 

where m(ti;; u) denotes the expected total discounted reward obtained by taking action u in 
the first slot followed by the optimal policy in future slots. Consider Vp^mi^', m = 0). It is given 
by the sum of the subsidy m obtained in the first slot under the passive action and the total 
discounted future reward /5Vam(^(^^)) which is determined by the updated belief state T(lj) 
(see ([T])). Vp^rni^^u = 1) can be similarly obtained, and we arrive at the following dynamic 
programming. 

V>,„(cu;m = 0) = m + /?V>,„(T(cu)), (7) 

Vp^rn{uJ]U = 1) = UJ + P{ujVp^rn{Pll) + {I - ^^)Vf^,m{P0l))- (8) 

The optimal action m^(^) for belief state u under subsidy m is given by 

* . ^ J 1, if ^/3,m(^; U = l)> Vp^„,{LJ; u = 0) 
u^iu) = ^ . (9) 

I 0, otherwise 
The passive set V{m) under subsidy m is given by 

V{m) = {uj : uliio) = 0} (10) 
= : Vfl,m(u;;M = 0) > Va,„(u;;u = 1)} (11) 

Definition 1: An arm is indexable if the passive set V{m) of the corresponding single-armed 
bandit process with subsidy m monotonically increases from to the whole state space [0, 1] as 
m increases from — oo to +oo. An RMBP is indexable if every arm is indexable. 

Under the indexability condition. Whittle's index is defined as follows. 

Definition 2: If an arm is indexable, its Whittle's index W{uj) of the state to is the infimum 
subsidy m such that it is optimal to make the arm passive at to. Equivalently, Whittle's index 
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W{uj) is the infimum subsidy m that makes the passive and active actions equally rewarding. 

W{uj) = inf{m : <(c<;) = 0} (12) 

m 

= mi{m : Va,m(t^; u = 0) = Vp^rniuj; u=l)}. (13) 

m 

In Fig. 121 we compare the performance (throughput) of the myopic policy, Whittle's index 
policy, and the optimal policy for the RMBP formulated in Sec. |lll We observe that Whittle's 
index policy achieves a near-optimal performance while the myopic policy suffers from a 
significant performance loss. 




Fig. 2. The performance by Whittle's index policy (K = 1, N ^ 7, {Po}}hi = {0-8, 0.6, 0.4, 0.9, 0.8, 0.6, 0.7}, {pn }Li = 
{0.6, 0.4, 0.2, 0.2, 0.4, 0.1, 0.3}, and B, = {0.4998, 0.6668, 1.0000, 0.6296, 0.5830, 0.8334, 0.6668}). 



IV. Whittle's index under discounted reward criterion 

In this section, we focus on the discounted reward criterion. We establish the indexability, 
obtain Whittle's index in closed-form, and develop efficient algorithms for computing an upper 
bound of the optimal performance to provide a benchmark for evaluating the performance of 
Whittle's index policy. 
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A. Properties of Belief State Transition 

To establish indexability and obtain Whittle's index, it suffices to consider the single-armed 
bandit process with subsidy m. Again, we drop the channel index from all notations and set 

B = l. 



ijj 



0J„ 



Fig. 3. The fc-step belief update of an unobserved arm (pn > poi)- 




Fig. 4. The fc-step belief update of an unobserved arm (pn < poi)- 



The following lemma establishes properties of belief state transition that reveal the basic 
structure of the RMBP considered in this paper. We resort often to these properties when deriving 
the main results. 

Lemma 1: Let T^{uj{t))=Vi[S{t + k) = l\uj{t)] (fc = 0, 1, 2, ■ ■ ■ ) denote the fc-step belief 
update of uj{t) when the arm is unobserved for k consecutive slots. We have 

^fc/ N ^ Poi - jPu - VmfiPoi - (1 + Vol - Vii)^) ^^^^ 

1 + Poi - Pll 

min{poi,Pii} < '^^{'^) < max{poi,Pii}, V G [0, 1], V A; > 1. (15) 

Furthermore, the convergence of T''{cu) to the stationary distribution tOo = p^f^p^^ has the 
following property. 
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• Case 1: Positively correlated channel (pn > poi). 

For any tu e [0, 1], T'^{uj) monotonically converges to as A; — > oo (see Fig. |3]). 

• Case 2: Negatively correlated channel (pu < Pqi). 

For any uj G [0,1], T'^^{ijj) and T^'^+^(a;) converge, from opposite directions, to uJo as 
/c — > oo (see Fig. H]). 

Proof: T^{uj) = ujT^{l) + (1 - uj)r^{<d), where T^{1) = Pr[5'(t + k) = l\S{t) = 1] is the 
A;— step transition probability from 1 to 1, and T^(0) = FT[S{t + k) = l\S{t) = 0] is the A;— step 
transition probability from to 1. From the eigen-decomposition of the transition matrix P (see 
[28]), we have T^l) = and T'^(O) = poi(i-(pii-poi)^) ^ ^ ^ ^ 

Other properties follow directly from (O. ■ 

Next, we define an important quantity L{tu,Lu'). Referred to as the crossing time, L{uj,uj') is 
the minimum amount of time required for a passive arm to transit across uj' starting from uj. 

L{uj,uj')=wln{k : T^{uj) > u'}. 

For a positively correlated arm, we have, from Lemma [H 



L{u!, uj') = < 



0, if > uj' 

P01-"'(1-Pll+P0l) 

Llog;":pt""^"™^' J + 1, i{uj<uj'<u;o ■ (16) 

oo, if < uj' and uj' > uj, 



o 



For a negatively correlated arm, we have 



L(uj, uj') 



0, if Co" > uj' 

1, ifu< uj' and T{u) > uj' ■ (17) 
oo, if < uj' and T(uj) < uj' 



^ It is easy to show that pn > poi corresponds to the case where the channel states in two consecutive slots are positively 
correlated, i.e., for any distribution of S{t), we have E[(S(t) ~ E[S(t)])(S(t + 1) - E[S{t + 1)])] > 0, where S{t) is the 
state of the Gilbert-Elliot channel in slot t. Similar, pn < poi corresponds to the case where S{t) and S{t + 1) are negatively 
correlated, and pn = poi the case where S{t) and S{t + 1) are independent. 
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B. The Optimal Policy 

In this subsection, we show that the optimal policy for the single-armed bandit process with 
subsidy m is a threshold policy. This threshold structure provides the key to establishing the 
indexability and solving for Whittle's index policy in closed-form as shown in Sec. IIV-EI 

This threshold structure is obtained by examining the value functions Vp^ra{^]u = 0) and 
V8,m(i^; u = 1) given in © and ([8]). From ([8]), we observe that Va,m(c<j; m = 1) is a linear function 
of CO. Following the general result on the convexity of the value function of a POMDP [29], we 
conclude that Vfs^mi^^', u = 0) given in (|7]) is convex in u. These properties of Vp^mi^', u = 1) 
and V0,m(co'; u = 0) lead to the lemma below. 

Lemma 2: The optimal policy for the single-armed bandit process with subsidy m is a thresh- 
old policy, i.e., there exists an ujp(m) E M such that 

1 if Co" > uo*p{m) 
if u < ujlim) 
and Vfl,„(u;;^(m); u = 0) = Va,„(u;^(m); u = 1). 







^^^^^ 1 
^^^^^ 1 

Passive Set Active Set 




1 
1 
1 
1 



1 u 



Fig. 5. The optimality of a threshold policy (0 < m < 1). 

Proof: Consider first < m < 1. We have the following inequality regarding the end points 

of \/^,™(0; u=l) and Vfs,^^; u = 0) (see Fig. B- 

Vf,^m{0-u = l) = pVp,m{Poi)<m + pVp,m{Poi) = Vp,UO;u = 0), (18) 
Vf,,m{l;u = l) = l+(3Vp,UPii)>m + PVp,UPn) = Vp,m{'^-,u = ^)- (19) 
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UJ 



V3,,„([j;u - 1) - (i_^)(\_''jjt,^+^poi 



v/3,m(.ci;, u - Uj - m -h /5 (i_^)(i_^p^^+^p^^) 



Fig. 6. The optimality of a threshold policy (m > 1). 



Fig. 7. The optimality of a threshold policy (m < 0.) 



Since Va^m(u;;u = 1) is linear in cij and Vf3^m{^',u = 0) is convex in to, Vp^rni^'.u = 1) and 
Vg n = 0) must have one unique intersection at some point uj'^im) as shown in Fig. [51 

When m > 1, it is optimal to make the arm passive all the time since the expected immediate 
reward cu by activating the arm is uniformly upper bounded by 1 (see Fig. (6]). We can thus 
choose ujp{m) = c for any c > 1. 

When m < 0, we have (see Fig. |7]) 

Fftm(0; M = 1) = pVp,Upoi) >m + pVp,Upoi) = Vp,m{0- u = 0), (20) 
Vp,m{l]u=l) = l+pVp,Upn)>m + pVp,Upu) = Vp,UO-u = 0). (21) 

Based on the convexity of V/s^mi^'jU = 0) in u, we have Vi3^rn{^]u = 1) > Vp^ra{^]u = 0) 
for any to E [0, 1]. It is thus optimal to always activate the arm, and we can choose uJ*p{'m) = 
b for any b < 0. Lemma [2] thus follows. The expressions of Vfl m(0;M = 1) and Vflm(0;M = 0) 
given in Fig. [6] and Fig. |7] are obtained from the closed-form expression of the value function, 
which will be shown in the next subsection. ■ 



C. Closed-form Expression of The Value Function 

In this subsection, we obtain closed-form expressions for the value function Va,m(^^)- This 
result is fundamental to calculating Whittle's index in closed-form and analyzing the performance 
of Whittle's index policy. 
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Based on the threshold structure of the optimal policy, the value function Va,m(t^) can be 
expressed in terms of „i(^'^(t^); = 1) for some to £ Z^U{oo}, where = L(tu,uj'^{m)) + 1 
is the index of the slot when the belief u transits across the threshold co'^(m) for the first time 
(recall that L{uj , uj^{m)) is the crossing time given in (fT6l) and (flTl)). Specifically, in the first 
L{lu, co'^(m)) slots, the subsidy m is obtained in each slot. In slot to = L(to, uj'^{m)) + l, the belief 
state transits across the threshold uJp{m) and the arm is activated. The total reward thereafter is 
Va,m(T^^'^''^^'^"'^^(cj); u = 1). We thus have, considering the discount factor. 



I3,m 



.00 



1-p 



(22) 



Since Vg ,„(T^(u;); -u = 1) is a function of V0,m(Poi) and V/s^rniPn) as shown in ©, we only 
need to solve for Vam(poi) and Va,m(Pii)- Note that poi and pu are simply two specific values 
of uj; both Vg m(Poi) and Vfl m(pii) can be written as functions of themselves through (|22l) . We 
can thus solve for Vfl^m(Poi) and Va,m(Pii) as given in Lemma [H 

Lemma 3: Let ujp{m) denote the threshold of the optimal policy for the single-armed bandit 
process with subsidy m. The value functions Vfl,m(Poi) and Vg^m(Pii) can be obtained in closed- 
form as given below. 

• Case 1: Positively correlated channel (pii> pm) 



POl 



'01 J 



(l-/3)(l-/3pii+/3poi)' 

(l-/3pii)(l-/3'''''»i'"^^"")m+(l-/3)/3''<''°i-"^'"'»r^'''°i'"^^"»(poi) 
(l_/3pi,)(l_^)(l_/(''oi."^(-))+i)+(l_^)2^MPoi."^{-))+Vi(poi."J(™))(p^^)^ 

m 

1-/3' 

Pll+/3(l-pil)y3,m(P0l) 



if uj*^{m) < poi 

if Poi < oJp{m) < uJo(23) 

if io*^{m) > LOo 



Vf3,miP 



11. 



l-/3pi 



m 

1-/3' 



if ujl{m) < Pu 
if uj*p{Tn) > Pu 

Note that Va,m(poi) is given explicitly in (|23l) while Vfl,m(pii) is given in terms of V/3,m(poi) for 
the ease of presentation. 



(24) 
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Case 2: Negatively correlated channel (pu < poi) 

Pll(l-/3)+/3poi -r ,*/^\ ^ „ 

(l-/3){l-/3pn+/3m)' ""/^^"^^ < 



11 



< < T(pn) . (25) 

Vf^Av^x) = { ^ . (26) 

T3^, if ^K"^) ^ Poi 

Note that Va,m(pii) is given explicitly in (|25l) while Vfl,m(poi) is given in terms of Vp^rniPu) for 

the ease of presentation. 

Proof: The key to the closed-form expressions for Vfl,m(Poi) and Vg,m(Pii) is finding the 

first slot that the optimal action is to activate the arm {i.e., the belief state transits across the 

threshold u;^(m)). This can be done by applying the transition properties of the belief state given 

in Lemma [T] See Appendix A for the complete proof. ■ 

D. The Total Discounted Time of Being Passive 

In this subsection, we study the total discounted time that the single-armed bandit process 
with subsidy m is made passive. This quantity plays the central role in our proof of indexability 
and in the algorithms of computing an upper bound of the optimal performance as shown in 
Sec. ITV-El and Sec. HV-Fl 

Let D/s^rni^) denote the total discounted time that the single-armed bandit process with subsidy 
m is made passive under the optimal policy when the initial belief state is cj. It has been shown 
by Whittle that Dp^rni'^) is the derivative of the value function Vp^mi'^) vvith respect to m [3]: 

J-^l3,m{i^) = ; ■ 

am 

This result is intuitive: when the subsidy for passivity m increases, the rate at which the total 
discounted reward Vs,m(t^) increases is determined by how often the arm is made passive. 

Based on the threshold structure of the optimal policy, we can obtain the following dynamic 
programming equation for Dp^mi^) similar to that for V/s^mi^) given in (|22l) . 

Dp,M = ^— ^ + /3^(-'-;^(™))+Hr^(-'-a(™))(^)D^,„(pn) + (1 - T^("'"^("^))(^))^/3,™(m)).(27) 

Specifically, the first term in (|27l) is the total discounted time of the first L{uj,ujp{m)) slots 
when the arm is made passive. In slot L(uj , uj^{m)) + 1, the arm is activated. With probability 
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r^("'"M'"))(cj), the channel is in the good state in this slot, and the total future discounted 
passive time is -D/3,m(Pii)- With probability 1 - 7'^^'^''^^^'"^^(u;), the channel is in the bad state 
in this slot, and the total future discounted passive time is -D/3,m(Poi)- 

By considering uo = poi and uo = pu, both -D/3,m(Poi) and Dp^rn{pii) can be written as 
functions of themselves through (ITTI) . We can thus solve for -D/3,m(poi) and Dp^rnivii) as given 
in Lemma |4l 

Lemma 4: Let uj*p{m) denote the threshold of the optimal policy for the single-armed bandit 
process with subsidy m. The total discounted passive times -D/3,m(poi) and -D/3,m (pii) are given 
as follows. 

• Case 1: Positively correlated channel (pii> Pqi) 



(l-/3pii)(l-/3^^^°i-"^'"'») 



1 

1-/3' 



/3(l-Pll)g/3,m(POl) 



l-/3pii 



1 

1-/3' 



if uj*p{m) < pii 
if ujlirn) > pii 



if cj^(m) < poi 

if pm < uj*p{m) < Uo (28) 

if uj*^{m) > Uo 

(29) 



Case 2: Negatively correlated channel (pn < poi) 

0, 



^/3,m(Pll, 



-D/3,m(P01> 



l-/3(l-poi) 

l-/3(l-poi)-/3^r(pii)(l-/3)-/3^poi ' 
1 

1-/3' 



if u;^(m) < pu 

if Pii < ^;^(?^) < ^(Pii) 
if cu;(m) > T{pn) 



(30) 



(31) 



Proof: The process of solving for -D/3,m(Poi) and -D/3,m(Pii) is similar to that of solving for 
Vg^m(Poi) and Vfl,m(Pii)- Details are omitted. -D/3,m(Poi) and -D/3,m(Pii) can also be obtained by 
taking the derivatives of Va,m(poi) and V0,m(pii) with respect to m. ■ 

We point out that Vp^mi^) is not differentiable in m at every point (i.e., the left derivative 
may not equal to the right derivative). Suppose that V0,m(^^) is not differentiable at mo. Then it 
can be shown that the left derivative at itlq corresponds to the case when the threshold u;^(mo) 
is included in the active set while the right derivative corresponds to the case when c<j^(mo) is 
included in the passive set. In this paper, we include the threshold in the passive set (see (fTT)) '). 
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i.e., we choose the passive action when both actions are optimal. As a consequence, we consider 
the right derivative of Va,m(u;) when it is not differentiable. 

The following lemma shows the piecewise constant (a stair function) and monotonically 
increasing properties of Df^^mi^) as a function of m. These properties allow us to develop 
an efficient algorithm for computing a performance upper bound as shown in Sec. IIV-FI 

Lemma 5: The total discounted passive time Dp^mi^) as a function of m is monotonically 
increasing and piecewise constant (with countable pieces for pn > poi and finite pieces for 
Pii < Poi)- Equivalently, the value function Vfl,m(^^) is piecewise linear and convex in m. 

Proof: The piecewise constant property follows directly from (ITTI) and Lemma |4] and is 
illustrated in Fig.[lO]and Fig.[TTJ The monotonicity of Df^^rni^) applies to a general restless bandit 
and has been stated without proof by Whittle [3]. We provide a proof below for completeness. 

We show that Vg,m(^) is convex in m, i.e., for any < a < l,mi,m2 G TZ, 

,m2('-^) ^ ^,ami + (l— a)m2 ('-^)- (32) 

Consider the optimal policy vr under subsidy ami + (1 — a)m2. If we apply vr to the system 
with subsidy mi, the total discounted reward will be 

Va,ami + (l-a)m2(t^) + + (t^) ((1 - - ms)). 

Since tt may not be the optimal policy under subsidy mi, we have 

Va,mi(t^) > Va,ami+(l-a)m2(^^) + ^/3,ami + (l-a)m2 (t^) ((1 " -"^2)). (33) 

Similarly, 

Vfl,m2(t^) > Va,ami+{l-a)m2(^) + -D/3,ami+(l-a)m2 (^) ("("^2 - (34) 

(|32)) thus follows from ^ and ■ 

E. Indexability and Whittle's Index Policy 

With the threshold structure of the optimal policy and the closed-form expressions of the value 
function and discounted passive time, we are ready to establish the indexability and solve for 
Whittle's index. 

Theorem 1: The restless multi-armed bandit process {Pi}iLi, {Bi}fLi, P) is indexable. 

Proof: The proof is based on Lemma |2] and Lemma IH Details are given in Appendix B. ■ 
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Theorem 2: Whittle's index W^{uj) e R for arm i of the RMBP {Pi}ti, {Bi}ti, P) 

is given as follows. 

• Case 1: Positively correlated channel (p(l > p^l). 



ujBi, if < PqI or uj > p(l 

m Bi, if uj'i'^ <uj< jid (35) 

^_^ri(a;)+C2{l-/3)(/3(l-/3p(\))-/3{<^-/3riH))^ (i) ^(i) 

l-/3p(\' -Ci (/3(l-/3pi\' )-/3(..-/3r 1 (<^))) " -^01 



Where d —Br^-———W——7J^ 



(l-/3p(\')(l-/3^(*'oi'")+i)+(l-/3)/3^(Poi'")+ir^(Poi-")(p«) 



Caj'e 2; Negatively correlated channel (p^ < PqI). 



uoBi, if Co" < p'd ox uo > PqI 

l+/3(Poi -'^) 

(i-/3+^C.)(/3pffH-.(i-/3)) if.;«<c.<Ti(pg) ' ^^^^ 

{l-/3)(/3pW +<^-/3ri H)-C4/3{/3ri (a.)-/3pW -<^) 



l-/3(l-p<\' )+C3/3{/3ri (a;)-/3p[,\' -1^) 



if < u; < Uo 



where C3 = ^'^i^^f) and = '^'"^^'^f^., ■ 

Proof: By the definition of Whittle's index, for a given belief state uj, its Whittle's index 
is the subsidy m that is the solution to the following equation of m: 

u + /?(cuV>,„(pn) + (1 - cu)V>,„(poi)) = m + pVp^^{T\uj)) . (37) 

" V ' V ' 

V'/3,,„(w;n=l) Va_„(a;;u=0) 

From the closed-form expressions for V0,m(Pii): Vam(Poi) and Vg ,„(T^(co')) given in Lemma[3l 
we can solve (l37l) and obtain Whittle's index. ■ 



The following properties of Whittle's index Wj3{ijj) follow from Theorem [H and Theorem |2l 
Corollary 1 : Properties of Whittle 's Index 

• Wp{u) is a monotonically increasing function of to. As a consequence, Whittle's index 
policy is equivalent to the myopic policy for stochastically identical arms. 
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• For a positively correlated channel (pu > poi), W^{uj) is piecewise concave with countable 
pieces. More specifically, Wf3{uj) is linear in [0,poi] and [pu, 1], concave in [uJo,pu), and 
piecewise concave with countable pieces in (^01,1^0) (see Fig. [81-left). 

• For a negatively correlated channel (pn < p^i), Wis^uj) is piecewise convex with finite 
pieces. More specifically, Wij^cu) is linear in [0,pii] and [poi,l], concave in (pu^LUo), 
[uJo,T{pn)), and [T{pu),poi) (see Fig. [8l-right). 

The equivalency between Whittle's index policy and the myopic policy is particularly impor- 
tant. It allows us to establish the structure and optimality of Whittle's index policy by examining 
the myopic policy which has a very simple index form. 

Note that the region of [poi, t^o) for a positively correlated arm is the most complex. The infinite 
but countable concave pieces of Whittle's index in this region correspond to each possible value 
of the crossing time L(poi,uj) E {1,2,---}. This region presents most of the difficulties in 
analyzing the performance of Whittle's index policy as shown in the next subsection. 



1 

0.9 

0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 












linear 


j piecewise concave | 


concave J/' linear 




0<a><p„^ 














's index 
















Whi 




0.2 0.4 


0.6 0.8 





Bell 



1 

0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 






r 


1 : Convex : j 






£^ . 


linear 




1 Convex | linear 




r I 


' T(P„) ' Poi 



ief 0) 



0.2 



0.4 0.6 

Belief H) 



0.8 



Fia 



Whittle's index (left: pn = 0.8, poi = 0.2, /3 = 0.9; right: pn = 0.4, poi = 0.8, /3 = 0.9). 



F. Performance of Whittle 's Index Policy 

1 ) The optimality of Whittle 's Index Policy under a Relaxed Constraint: Whittle's index policy 
is the optimal solution to a Lagrangian relaxation of RMBP [3]. Specifically, the number of 
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activated arms can vary over time provided that its discounted average over the infinite horizon 
equals to K. Let K(t) denote the number of arms activated in slot t. The relaxed constraint is 
given by 

E^[{l-l3)Er=iP'-'K{t)]=K. (38) 

Let Vg(n(l)) denote the maximum expected total discounted reward that can be obtained under 
this relaxed constraint when the initial belief vector is Based on the Lagrangian multiplier 
theorem, we have [3] 

V,{m) = mf{Sf=i\/j;i (c.,(l)) - ^^^}, (39) 

where V^'^l^{uj) is the value function of the single-armed bandit process with subsidy m that 
corresponds to the z-th channel. 

The above equation reveals the role of the subsidy m as the Lagrangian multiplier and the 
optimality of Whittle's index policy for RMBP under the relaxed constraint given in (|38l) . 
Specifically, under the relaxed constraint. Whittle's index policy is implemented by activating, 
in each slot, those arms whose current states have a Whittle's index greater than a constant m*. 
This constant m* is the Lagrangian multiplier that makes the relaxed constraint given in (l38l) 
satisfied, or equivalently, the Lagrangian multiplier that achieves the infimum in (l39l) . It is not 
difficult to see that Whittle's index policy implemented by comparing to a constant m* is the 
optimal policy {i.e., achieves Va(i7(l))) for RMBP under the relaxed constraint. 

2) An Upper Bound of The Optimal Performance: Under the strict constraint of K{t) = K 
for all t. Whittle's index policy is implemented by activating those K arms with the largest 
indices in each slot. Its optimality is lost in general. 

Let Va(fi(l)) denote the maximum expected total discounted reward of the RMBP under the 
strict constraint that K{t) = K for all t. It is obvious that 

Vpim) < Mm)- 

Vg(f2(l)) thus provides a performance benchmark for all RMBP policies, including Whittle's 
index policy. Unfortunately, Vg(r2(l) as given in (l39l) is, in general, difficult to obtain due to the 
complexity of calculating the value functions of all arms and searching for the infimum over an 
uncountable space. For the problem at hand, however, we have obtained Vp]^{uJi{l)) in closed- 
form as given in Lemma [3l Furthermore, the piecewise constant structure of the discounted 
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passive time D^*|^(c<jj(l)) given in Lemma [5] leads to efficient algorithms for searching for the 
infimum of the value functions over m as shown below. 
Let 

We then have Vfl(i7(l)) = inf^ G/3,m(^(l)5 "^)- From Lemma[5l it is easy to see that 
is convex in m as illustrated in Fig. |9l The infimum of G/3 ^(^(1)) is achieved at m* at which 
the derivative of G/3,m(^(l)) with respect to m becomes nonnegative for the first time (note that 
G/3,m(^(l)) is not differentiable at every m, and we consider the right derivative when it is not 
differentiable). Equivalently, 

m = supjm : — = ^— ^ < 0}. 

From Lemma[5l Dp\^{uji{l)) is piecewise constant for each channel (see Fig. [TO] and Fig. [TT]) . 
We can thus partition the range of m into disjoint regions such that ^^^^•'^j^^^^^') is constant in each 
region. To obtain m*, we only need to check each region successively until ^^^^'^^^^^^^^ becomes 
nonnegative for the first time (due to the monotonically increasing property of Dp''^{uji{l)) in 
m). The difficulty is that for a positively correlated channel, there are infinite constant regions of 
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Fig. 10. The passive time for different regions (pn < poi)- 



Gray Area (infinite pieces) 



m 



Fig. 11. The passive time for different regions (pn > poi)- 



D 



I3,m 



J^l)) (see Fig. [TTI). However, we can find an arbitrarily small interval {Wp{y}), M/'^(u;)]- 



referred to as the gray area — outside which there are only finite number of constant regions of 
D^^^iujiiV)). By setting the gray area for each positively correlated channel small enough, we 
can find an vn! that is arbitrarily close to m* so that G'/3,m/(i7(l))) — G'/3,m*(i7(l)) < e for any 
e > 0. Specifically, we set the length of the gray area for each positively correlated channel 
to {i.e., py^(co'o) — Wji{uj) < j^) where 6 = The total length of the gray area over 

all channels is thus at most S, i.e., m' — m* < 5. Based on the convexity of the 
maximum derivative of G'^m(^(l)) for m* < m < 1 is achieved at m = 1, which is equal to 
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Thus, we have 

K SK 

We point out that if m* does not fall into the gray area, the algorithm will obtain m* and 
Vg(fi(l)) without error. In the special case when every channel is negatively correlated, the 
algorithm will always output the exact value of m* and The detailed algorithm is 

given in Fig. [121 The complexity of this algorithm is given in the following theorem. 

Computing the Performance Upper Bound within e-Accuracy 

Input an e > 0. Set 6 = '-^^ and j = 0. 

1) For each negatively correlated channel i, calculate W^lpil), Wi3{pqI), and iy^(T(p^\^)). If 
uji{l) < uJo \ calculate Wf3{uJi{l)) and Wf3{T^{uJi{l))); otherwise only calculate Wi3{uJi{l)). 

2) For each positively correlated channel i, calculate M//3(pq\^), and Wfi{uJo^). 
Search for an cu^'') G [U'^ - j^,uji'^) such that Wfsico^^) > W^icu'^'^) - |. Let h be the 
smallest integer such that T^'{Poi) > oo^'^- Calculate iy^(r'=(p[,?)) for all 1 < A; < k. If 
uji{l) < uJo \ then let di be the smallest integer such that T'^'{uJi{l)) > tu^*) and calculate 
Wfj{T'^{uJi{l))) for all 1 < A; < di; otherwise only calculate Wp{iUi{l)). Set the gray area 

3) Order all Whittle's indices calculated in Step 1 and 2 by the ascending order. Let [ai, ...ah] 
denote the ordered Whittle's indices. Set oq = and a^+i = 1. 

4) If ^ V, calculate D = E^^^Dfl^icukil)) - ^-^^ for m G K-.a^+i) according 
to (l27l) (note that every D^^l^{uJk{l)) is constant for m G [aj,aj^i)). If D is nonnegative, 
go to Step 5; otherwise set j = j + 1 and repeat Step 4. 

5) Calculate G = Gp^mi^i^)) when m G [a^, Oj+i) according to ((22]). Output m' = aj and 
G. 

Fig. 12. Algorithm for computing the upper bound of the optimal performance. 



Theorem 3: For any e > 0, the algorithm given in Fig. [T2]runs in at most 0{N'^ log A^) time 
to output a value G that is within e of Va(i7(l)) for any e > 0. 

Proof: See Appendix C. ■ 
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To find the infimum of Gi3{Q{l),m), we can also carry out a binary searcli on subsidy m. 
It can be sliown tliat tliis algorithm runs in 0(A^(log A^)^) time. However, it cannot output the 
exact value of m* and Va(i7(l)). 

Fig. [13] shows an example of the performance of Whittle's index policy. It demonstrates the 
near optimal performance of Whittle's index policy and the tightness of the performance upper 
bound. 




K 



Fig. 13. The Performance of Whittle's index policy (iV = 8, {pi'ijLi = {0-2, 0.5, 0.8, 0.1, 0.6, 0.2, 0.3, 0.8}, = 
{0.4, 0.1, 0.3, 0.6, 0.2, 0.8, 0.7, 0.6}, = 1 for i = 1, . . . , 8, and /3 = 0.8). 

V. Whittle's index under Average reward criterion 

In this section, we investigate Whittle's index policy under the average reward criterion and 
establish results parallel to those obtained under the discounted reward criterion in Sec. HVl 

A. The Value Function and The Optimal Policy 

First, we present a general result by Dutta [30] on the relationship between the value function 
and the optimal policy under the total discounted reward criterion and those under the average 
reward criterion. This result allows us to study Whittle's index policy under the average reward 
criterion by examining its limiting behavior as the discount factor /3 — > 1. 
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Dutta's Theorem [30]. Let JF be the belief space of a POMDP and VpiVt) the value function 
with discount factor (3 for belief G JF. The POMDP satisfies the value boundedness condition 
if there exist a belief Vt' , a real- valued function Ci{Vl) : !F ^ IZ, and a constant C2 < oo such 
that 

ci{n)<v^{n)-v^{n')<c2, 

for any E !F and (3 E [0, 1). Under the value-boundedness condition, if a series of optimal 
policies TT^j. for a POMDP with discount factor pointwise converges to a limit n* as ^ 1, 
then vr* is the optimal policy for the POMDP under the average reward criterion. Furthermore, 
let J(fi) denote the maximum expected average reward over the infinite horizon starting from 
the initial belief fi. We have 

J(^])= lim(l-/?,)v>,(^]) 

and J(fi) = J is independent of the initial belief 

Next, we will show that the single-armed bandit process with subsidy m under the discounted 
reward criterion (see Sec. IIII-BI) satisfies the valueboundedness condition. 

Lemma 6: The single-armed bandit process with subsidy under the discounted reward criterion 
satisfies the value-boundedness condition. More specifically, we have 

\V(3,U^) - V(3,U^')\ < c + 1, for all u, u' e [0, 1], (40) 

where c = maxj , ^ , — |. 

L I-Pll ' Poi J 

Proof: See Appendix D. ■ 
Under the value boundedness condition, the optimal policy for the single-armed bandit process 
with subsidy under the average reward criterion can be obtained from the limit of any pointwise 
convergent series of the optimal policies under the discounted reward criterion. The following 
Lemma shows that the optimal policy for the single-armed bandit process with subsidy under 
the average reward criterion is also a threshold policy. 

Lemma 7: Let uJp{m) denote the threshold of the optimal policy for the single-armed bandit 
process with subsidy m under the discounted reward criterion. Then lim/3^1 io^{m) exists for any 
m. Furthermore, the optimal policy for the single-armed bandit process with subsidy m under 
the average reward criterion is also a threshold policy with threshold u*(m) = lim^^i co'^(m). 



^Here we do not consider the trivial case that the arm has absorbing states. 
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Proof: See Appendix E. ■ 

B. Indexability and Whittle's index policy 

Based on Lemma |7l the restless multi-armed bandit process (fi, {Pj}^^, {Sj}^^, 1) is in- 
dexable if the threshold uj*{m) of the optimal policy is monotonically increasing with subsidy 
m. Next, we show that the monotonicity holds and the restless multi-armed bandit process 
{Pj}^]^, 1) is indexable. Moreover, we obtain Whittle's index in closed-form as 

shown below. 



Theorem 4: The restless multi-armed bandit process (^^(1), {Pijili, {-Bijili, 1) is indexable 
with Whittle's index W{u)) given below. 

• Case I: Positively correlated channel {p\{ > Poi)- 



W{uj) 



00 Bi, 



if c<j < Pq{ or u; > p\{ 



[i) , -Pi, It Poi < ^ < 



i-p«+(..-ri(c.)L(p«,<^)+r^(''6i-.")(pj,Y) 



-5, 



if uj^^ <uj< p'li 



1 (") 1 *' 



(41) 



Cai'e 2.- Negatively correlated channel {p\{ < Pq{ 



00 Bi, 



i+p«-ri(pW)+riH-.. 
-B. 



i+plJ-rHpiV) 



(0 



1 1 To *' 



if < p^f or c<j > Pq{ 



if pj? <oo < ooi'^ 



if cu^'^ < < Ti(rf?) 
if Ti(pg)<a.<pg 



(42) 



Proof: See Appendix F. 



The monotonicity and piecewise concave/convex properties of Whittle's index under the 
discounted reward criterion given in Corollary [Hare preserved under the average reward criterion. 
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The only difference is that Whittle's index under the discounted reward criterion is always strictly 
increasing with the belief state while Whittle's index W{uj) under the average reward criterion is 
a constant function of u when Uo < to < T^(pii) for a negatively correlated channel (see (1421)). 

C. The Performance of Whittle 's Index Policy 

Similar to the case under the discounted reward criterion, Whittle's index policy is optimal 
under the average reward criterion when the constraint on the number of activated arms K{t) [t > 
1) is relaxed to the following. 

Let J{Vl{1)) denote the maximum expected average reward that can be obtained under this 
relaxed constraint when the initial belief vector is Vl(1). Based on the Lagrangian multiplier 
theorem, we have [3] 

J = inf{E,^i J» - m{N - K)}, (43) 

m 

(i) 

where Jm is the value function of the single-armed bandit process with subsidy m that corre- 
sponds to the i-th channel. 

Let J(i7(l)) denote the maximum expected average reward of the RMBP under the strict 
constraint that K{t) = K for all t. Obviously, 

J(fi(l)) < J. 

J thus provides a performance benchmark for Whittle's index policy under the strict constraint. 
To evaluate J, we consider the single-armed bandit with subsidy m under the average reward 
criterion. The value function Jm and the average passive time Dm = can be obtained in 
closed-form as shown in Lemma [8] below. 

Lemma 8: The value function Jm and Dm can be obtained in closed-form as given below, 
where ij*{m) is the threshold of the optimal policy. Furthermore, Dm is piecewise constant and 
increasing with m. 



f 



J n 



Uo, if uj*{m) < min{poi,Pii} 



if Poi ^ uj*{m) < ujo 



(l-pii)(L(poi ,'^*{m))+l)+r^{POl >-'*('")) (poi)' 



(44) 



m, other cases 
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and 



fo, 



if u*{m) < min{poi,Pii} 

if Poi < uj*{m) < Uo 

if pii < uj*{m) < T\pu) 



(l-pii)L(poi,Lj*(m)) 



= < 



(l-pii )(L(poi .i^* {m))+l)+r ^{POl (-)) (poi ) 



(45) 



m 



POl 



i+2poi-ri{pii)' 



1 



other cases 



v 



Proof: Under the value-boundedness condition as shown in Sec. IV-A[ we have, according 
to Dutta's theorem, 



which leads to (|44l) directly. The closed-form expression for Dm can be obtained from the 
derivative of Jm with respect to m. The proof that Dm is increasing with m is similar to that 



Based on the closed-form Dm given in Lemma [8l we can obtain the subsidy m* that achieves 
the infimum in (l43l) . Specifically, the subsidy m* that achieves the infimum in (|43l) is the 
supremum value of m G [0, 1] satisfying S^^D™ * < N — K. After obtaining m*, it is easy to 
calculate the infimum according to the closed- form Jm given in Lemma [8l With minor changes, 
the algorithm in Sec. IIV-FI can be applied to evaluate the upper bound J. We notice that the 
initial belief will not be considered in the algorithm, which leads to a shorter running time. 

Simulation results similarly to Fig. [9] have been observed, demonstrating the near-optimal 
performance of Whittle's index policy under the average reward criterion . 

VL Whittle's Index Policy for Stochastically Identical Channels 

Based on the equivalency between Whittle's index policy and the myopic policy for stochas- 
tically identical arms, we can analyze Whittle's index policy by focusing on the myopic policy 
which has a much simpler index form. In this section, we establish the semi-universal structure 
and study the optimality of Whittle's index policy for stochastically identical arms. 

A. The Structure of Whittle 's Index Policy 

The implementation of Whittle's index policy can be described with a queue structure. Specif- 
ically, all N channels are ordered in a queue, and in each slot, those K channels at the head of 
the queue are sensed. Based on the observations, channels are reordered at the end of each slot 
according to the following simple rules. 



m 



\im (l- pk)Vi3^{uj,m), 



given in Lemma \5\ 
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When pii > pqi, the channels observed in state 1 will stay at the head of the queue while the 
channels observed in state will be moved to the end of the queue (see Fig. [141) . 

When pii < pqi, the channels observed in state will stay at the head of the queue while the 
channels observed in state 1 will be moved to the end of the queue. The order of the unobserved 
channels are reversed (see Fig. [T5l) . 



Sense 



Slit) = 1 
Siit) = 
Ssit) = 1 




/C(t + 1) 



Sense 




Fig. 14. The structure of Whittle's index policy (pn > poi) Fig- 15. The structure of Whittle's index policy (pn < poi) 



The initial channel ordering /C(l) is determined by the initial belief vector as given below. 

See Appendix G for the proof of the structure of Whittle's index policy. 

The advantage of this structure of Whittle's index policy is twofold. First, it demonstrates 
the simplicity of Whittle's index policy: channel selection is reduced to maintaining a simple 
queue structure that requires no computation and little memory. Second, it shows that Whittle's 
index policy has a semi-universal structure; it can be implemented without knowing the channel 
transition probabilities except the order of pu and poi- As a result. Whittle's index policy is 
robust against model mismatch and automatically tracks variations in the channel model provided 
that the order of pn and poi remains unchanged. As show in Fig. [TH the transition probabilities 
change abruptly in the fifth slot, which corresponds to an increase in the occurrence of good 
channel state in the system. From this figure, we can observe, from the change in the throughput 
increasing rate, that Whittle's index policy effectively tracks the model variations. 
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p^,=0.6, P(,^=0,1 (T<=5); p„=0.9, p^,=0,4 (T>5) 

0.65 I 1 1 1 1 1 1 1 



0.6 - 




'l 23456789 10 

Time slot (T) 



Fig. 16. Tracking the change in channel transition probabilities occurred at t = 6. 



B. Optimality and Approximation Factor of Whittle 's Index Policy 

Based on the simple structure of Whittle's index policy for stochastically identical channels, 
we can obtain a lower bound of its performance. Combining this lower bound and the upper 
bound shown in Sec. IV-C[ we further obtain the approximation factor of the performance by 
Whittle's index policy, which are independent of channel parameters. Recall that J denote the 
average reward achieved by the optimal policy. Let denote the average reward achieved by 
Whittle's index policy. 

Theorem 5: Lower and Upper Bounds of The Performance of Whittle 's Index Policy 

< Jw < J < mm{- ■ , uJoN} if pu > poi (47) 



1 -pii + rLi?J-i(poi) l-pn + uJo' 

< Jw < J < mm{- — — , uJoN} if pn < poi (48) 



i_r2LfJ-2(pi,)+Po, - - - 'i-r^{pu)+Poi 

Proof: The upper bound of J is obtained from the upper bound of the optimal performance 
for generally non-identical channels as given in (l43l) . The lower bound of is obtained from 
the structure of Whittle's index policy. See Appendix H for the complete proof. ■ 

Corollary 2: Let r] = ^ he the approximation factor defined as the ratio of the performance 
by Whittle's index policy to the optimal performance. We have 
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P,i<Pn- 



(N-2)/N - 



2/N ■ 



f 1/2+1 /N 



Fig. 17. The approximation factor of Whittle's index policy. 



Positively correlated channels Negatively correlated channels 

r/ = l, fori^ = l,Ar-l,Ar j r] = 1, for K = N - 1, N 

V>§, o.w. ' \ V> max{i, f }, o.w. 

Proof: See Appendix I. ■ 
Fig. [T7] illustrates the approximation factors of Whittle's index policy for both positively 
correlated and negatively correlated channels. We notice that the approximation factor approaches 
to 1 as increases. For negatively correlated channels, Whittle's index policy achieves at least 
half the optimal performance. For positively correlated channels, the approximation factor can 
be further improved under certain conditions on the transition probabilities. Specifically, we have 

From CoroUaryO Whittle's index policy is optimal when K = 1 (for positively correlated channels) 
and K = N—1. The optimality for K = N is trivial. We point out that for a general K, numerical 
examples have shown that actions given by Whittle's index policy match with the optimal actions 
for randomly generated sample paths, suggesting the optimality of Whittle's index policy. 

VII. Conclusion 

In this paper, we have formulated the multi-channel opportunistic access problem as a restless 
multi-armed bandit process. We established the indexability and obtained Whittle's index in 
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closed-form under both discounted and average reward criteria. We developed efficient algorithms 
for computing an upper bound of the optimal policy, which is the optimal performance under the 
relaxed constraint on the average number of channels that can be sensed simultaneously. When 
channels are stochastically identical, we have shown that Whittle's index policy coincides with 
the myopic policy. Based on this equivalency, we have established the semi-universal structure 
and the optimality of Whittle index policy under certain conditions. 

Appendix A: Proof of Lemma [3] 

From (f22)) . we have 



'01 J 



Vl3,m{p 



1-/3 



11 



m + /?^(P"''"^('"))V>,^(r^(J'"'"^('"))(poi); u=l) 



(49) 



(50) 



As shown in ©, y^,™(r^("''^^("))(cj) ;m = 1) is a function of Va,m(Poi) and Vp^rniPu) for any 
u G [0, 1]. We thus have two equations (|49l) and (l50l) for two unknowns Vfl,m(poi) and Va,m(pii) 
provided that we can obtain the two crossing times L(poi,uj^(m)) and L(pii,uj^(m)). 

From (fT6l ) and (fTTl ). we can obtain these crossing times by considering different regions that 
the threshold uj'^{m) may lie in (see Fig. [18] and Fig. [T9l) . We can thus solve for Vg,m(Poi) 
and Va,m(pii) from (|49l) and (l50l) by considering each region within which both crossing times 
L{pQi,ujp(m)) and L(pii,ujp(m)) are constant. 



L = L{poi,uj*f^{m)) 



L = L{pn,uj*p{m)) 



L = 



P01-'^5('")(l-!'ll+P0l) 



L = 



L = oo 

-fpiiV- 



L = oo 



Fig. 18. The threshold crossing time for different regions of Lj^(m) when pn > poi (the top partition is for L{poi,u>*j{m)), 
the bottom for L(pii,a;^(m))). 



Appendix B: Proof of Theorem [H 

It suffices to prove that an arm with an arbitrary transition matrix P is indexable. Based on 
the threshold structure of the optimal policy for the single-armed bandit with subsidy m given 
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L{pu,uj*p{m)) 



L = 



L = 1 



Pu 




L = oo 
Poi 



L = 



L = oo 



Fig. 19. The threshold crossing time for different regions of Lj^(m) when pn < poi (the top partition is for L(pii , tj^(m)), 
the bottom for L{poi,LL}%{m))). 



in Lemma [21 indexability is reduced to the monotonicity of the threshold uj^{m), i.e., uj'^{m) 
is monotonically increasing with the subsidy m for m E [0, 1). To prove the monotonicity of 
uj*p{m), we first give the following lemma. 

Lemma 9: Suppose that for any m G [0, 1) we have 

dm L=..;m- 

Then uJ*p{'m) is monotonically increasing with m. 

We prove Lemma|9]by contradiction. Assume that there exists an niQ E [0, 1) such that uj'^{m) 
is decreasing at itiq. Then, there exists an e > such that for any Am E [0, e], we have 

Vfl,mo+Am(t^^("^o); U = 1) > Vf3^rao+Am{^*p{mo); U = 0) . (52) 

Since co'^(mo) is the threshold of the optimal policy under subsidy rriQ, we have 

Vf}^rnoi^*fi{mo);u = 1) = Vp^rnoi^*fi{mo);u = 0). (53) 
From (|52l) and (1531 ). we have 

c/m l-=-M"o) - dm l-=-M™o)' 

which contradicts with (|5TI) . Lemma [9] thus holds. 

According to LemmaHl it is sufficient to prove dSB- Recall that Dp^m{uj) = ^^^^^^£r^. From © 
and dH]), we can write (|5TI) as 

/5(u;;(m)D^,„(pn) + (1 - a;;(m))D^,„(poi)) < 1 + /31^/3,™(ri(a;;(m))). (54) 

To prove (|54|) . we consider the following three regions of c<j^(m). 
Region 1: < uj*a{m) < min{j)oi,Pii}- 
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Based on the lower bound of the updated belief given in Lemma [B the arm will be ac- 
tivated in every slot when the initial belief u > uj*p{m). Thus, -D/3,m(Pii) = -D/3,m(Poi) = 
Dp^rn{'^\uj*{m))) = 0; dSl) holds trivially. 

Region 2: ujo < i^*p{jn) < 1. 

In this region, the arm is made passive in every slot when the initial belief state is {uj*p{m)) . 
This is because T'^ {uj*p{m)) < uJ*p{m) for any k >1 (see Lemma [B Fig. [3] and Fig.©. Therefore, 
Dp^rn{T^{i^l{m))) = j^. Since both Dp^^{pn) and /^/s.mlPoi) are upper bounded by j^g, it is 
easy to see that (l54l) holds. 

Region 3: min{poi,Pii} < ^^("^) < ^o- 

In this region, T^(ci;^(m)) > uj^{m) (see Fig. [3] and Fig.H]). Thus, T^{uj^{m)) is in the active 
set, which gives us 

To prove (|54|) . we consider the positively correlated and negatively correlated cases separately. 

• Case 1: Negatively correlated channel (pu < pm). 

Since poi > uJo> uj*^{m), poi is in the active set. We thus have 

D(3,miP0l) = P{P0lDf},m{Pn) + - P0l)Dt3,m{P0l))- (56) 

Substituting (|55l) and (l56l) into (|54|) . we reduce (|54|) to the following inequality. 

-Dp,m{Pn)il - P)iPPoi + c^;(m) - pT\cu;{m))) < 1. (57) 

Notice that the left-hand side of (|57|) is increasing with uj'^{m) and -D/3,m(pii)- It thus suffices to 
show the inequality by replacing uj^{m) with its upper bound Uo and -D/3,m (Pii) with its upper 
bound Y3g- After some simplifications, it is sufficient to prove 

f{P)=Poi{Poi -Pii)P'^ + PiPoi + 1-Pu -pli +P01P11) - 1 -Poi+Pu < 0. (58) 

It is easy to see that f{(3) is convex in /?, /(O) = — 1 — poi + Pn < 0, and /(I) = 0. We thus 
conclude that /(/?)< for any < /5 < L 

• Case 2: Positively correlated channel (pn > poi). 

Since pu > uJo > LO*p{m), pu is in the active set. We thus have 

-D/3,m(Pll) = P{pllDl3,m{pll) + {I - Pll)Dp,m{P0l))- (59) 
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Substituting (|55] ) and (|591 ) into (|54l ). we reduce (l54l) to the following inequality. 

(3Dfs,UPoi)il - - \ \, < 1. (60) 

J- ~ PPii 

Substituting the closed-form of -D/3,m(poi) given in (|28l) into (|60l) . we end up with an inequality 
in terms of L{pQi,uj*p{m)) and ^^^(m). Notice that the left-hand side of (|60l) is decreasing 
with uj*p{'m). It thus suffices to show the inequality by replacing uj*p{'m) with its lower bound 
T^^^'''''^^^™^^"^(poi) (by the definition of L{poi,ujp{m))). Let x = pu -poi- After some simpli- 
fications, it is sufficient to show that for any < /5 < 1, < poi 0<a;<l — poi, L G 
{0,1,2,..}, 



Since /(O) = 1 — x > and /(I) = 0, it is sufficient to prove that is strictly decreasing 

= (L + 2)/3^+Voia;'^+'(l - x) + 2^(poix^+' + a; - - poia:) + (x" + poix - poix'^+' - 1). (62) 



with /? for < < 1, which follows by showing < for < /5 < 1. 



did) 

To show '^^^^pj^ < for < /3 < 1, we will establish the following two facts: 
(^'' ^pT 1/3=1 ^ "J- 



(ii) ^^^1^^^^ is strictly increasing with /3. 

To prove (i), we set /3 = 1 in (|62|) . After some simplifications, we need to prove 

^(Poi)= - PoiLx^^^ + poi(^ + - a;^ - Poia; + 2a; - 1 < 0. (63) 

Since /i(0) = — (x — 1)^ < 0, it is sufficient to prove that /i(poi) is monotonically decreasing 
with pqi, i.e., we need to prove 

d{h{pQi)) 



d{poi) 



+ + -x<0. (64) 



Since Lx^^^ < ^^=1^'' = ^^^y ^^^^^ (|64l) holds. We thus proved (i). 

To prove (ii), it suffices to show that the coefficient of P in (|62l) is nonnegative, i.e., we need 



+ X - - poia; > 0. (65) 



to prove 



Since < x < 1 — poi, we have poix{x^^^ ^ 1) > — Poi^^ > (x — l)x. It is easy to see that (1651) 
holds. We thus proved (ii). 

From (i) and (ii), it is easy to see that '^'^^I^^J^ < for any < /3 < 1. We thus proved the 
indexability. 



39 



Appendix C: Proof of Theorem [3] 



We notice that Step 1 runs in 0{N) time. In Step 2, the number of regions that needs to be 
calculated for each channel is at most 0(log-^) = O(logA^). It runs in constant time to find 
li and di for channel i. So Step 2 runs in at most O(A^logA^) time. In Step 3, the ordering 
of all those probabilities needs at most 0(A^logA^)(log(0(A^log A^))) = 0(A^(log A^)^) time. 
Step 4 runs in 0{N) time for each region that does not belong to V. So Step 4 runs in at most 
O(A^^logA^) time. Finally, Step 5 runs in 0{N) time. Overall, the algorithm runs in at most 
0{N^\ogN) time. 



From Fig. [6l Fig. Ul and Fig. [5l we have, for any cu E [0, 1], 

min{Va,m(0;M = 1), Va,m(l;M = 1)} < Vfl,m(t^) < max{Va,m(0;M = 0), = 1)}. (67) 

Consequently, we have, for any lu,uj' E [0, 1], 

< max{\Vp,m{0;u^ 1) - F/5,™(1;m = 1)|, \Vp,^{0;u = 0) - Vp^^{0;u= 1)|, \Vp,„^{0■,u^ 0) - = 1)|) 

= max{\l3{Vi3,mipoi) ~ Va,,„(pii)) - 1|, |/3(ya.m(poi) - V^/3,m(pii))|, !)• 

Since |Vfl,„(poi) - V8,m(Pii)l < c for any /3 (0 < /3 < 1), then Va,™(t^) - Vfl,,„(t^')l < c + 1 for 
any (0 < /? < 1) and lu,lu' E [0, 1]. Thus the value-boundedness condition is satisfied. 



The convergence of uj^im) is trivial for m < and m > 1. 

For < m < 1, let W{uj) = lim^^i Wi3{to). This limit exists and is given in Theorem |4] (it 
is tedious and lengthy to get the limit and we skip the detailed calculation). Define uj*(m) as 
the inverse function of W{uj). We notice that W{uj) is a constant function (thus not invertible) 
when tOo < to < T^{pu) (see (|42|)). In this case, we set uj*(m) = T^(pii). Formally, we have 



Appendix D: Proof of Lemma [6] 



From the closed-form Va,m(poi) (see Lemma [3]), we have, for any /5 (0 < /5 < 1), 



V/3,m{P0l) - Vfl,m(Pll)| < C. 



(66) 



Appendix E: Proof of Lemma [7] 




c (c < 0) 
max{uj : W{uj) 
b{b>l) 



m} if < m < 1 



if m < 



if m > 1 



(68) 



40 



Next, we prove that lim^^i ^^^(m) = u!*{m) as /9 1 by contradiction. Since W{uj) = 
lim^^i Wj3{uj) and Wi3{iu) is increasing with cu, W{uj) is also increasing with u. Assume first that 
Wf3{uj) is strictly increasing at point ij^{m). We prove lim^^i uj^{m) = tu*{m) by contradiction 
as follows. 

Assume uj*p{m) uj*(m), i.e., there exists an e > 0, a /9' (0 < /9' < 1), and a series 
{/3fc} {I3k 1) such that \uj*p^{m) - uj*{m)\ > t for any (3k > (3'. If uj*{m) - e > ujp^{rn) 
for any /3k > P', then Wfs^^{iu* (m) — e) > Wfj^^{uj*p^{m)) for any /3k > P' by the monotonicity 
of Wi3i^{uj). Since W{uj) is strictly increasing at point Lu*{m), there exists a 5 > such that 
W{uj*{m)) - W{uj*{m) - e) > 5. Then we have, for any /3k > /3', 

W^^{uj*{m) - e) > W,3A^*^,{m)) = m = W {oo* (m)) > W{uj*{m) - e) + 6, 

which contradicts with the fact that lyg^. (tu^^ (m) — e) ^ W{uj*(m) — e) as 1- The proof 

for the case when uj*(m) + e < c<j^^(m) for any > P' is similar to the above. 

Consider next that W{uj) is not strictly increasing at point uj*(m). This case only occurs 
when pii < poi and uj*{m) = T^{pii). We notice that Wi3{T^{pu)) increasingly converges to 
W{T^{pu)) as P ^ 1. Thus uj*p{m) > T^(pii) = co'*(m) by the monotonicity of Wp{uj). Assume 
oj*p{m) uj*(m), i.e., , there exist an e > 0, a (0 < < 1) and a series {P^} (Pk 1) such 
that ujp^im) - uj*{m) > t for any Pk > P'. We have W[3^{uj*{m) + e) < W[^^{ujp^{m)) for any 
Pk > P' by the monotonicity of Wf3^{uj). Since W{uj) is strictly increasing in [uj*{m),uj*{m) + e\, 
there exists a 6' > such that W{uj*{m)+e) — W{Lj*{m)) > 6'. Then we have, for any Pk > P', 

Wf3,{uj*{m) + e) < iy^,(cj;^(m)) = m = W{uj*{m)) < W{uj*{m) + e) - 5', 

which contradicts with the fact that VT^j^ (cu^^ (m) + e) W{uj*{m) + e) as Pk — > 1. 

Next, we show that the optimal policy tt^^ for the single-armed bandit process with subsidy 
under the discounted reward criterion pointwise converges to a threshold policy n* as Pk I. 
To see this, we construct tt* as follows: (1) If m < 0, then the arm is made active all the time; (2) 
If m > 1, the arm is made passive all the time; (3) If < m < 1, then cu is made passive when 
current state to < uj*{m), otherwise it is activated. Since uj*p{'m) converges to uj*{m) as P 1, 
it is easy to see that tt^^ pointwise converges to tt* for any Pk ^ I. Because the single-armed 
bandit process with subsidy under the discounted reward criterion satisfies the value boundedness 
condition (see Lemma (6]), the threshold policy tt* is optimal for the single-armed bandit process 
with subsidy under the average reward criterion based on Dutta's theorem. 
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Appendix F: Proof of Theorem g] 

Since uj*(m) = lim^^i uj*p{m) and uj*p{m) is monotonically increasing with m (see Theorem [B, 
it is easy to see that uj*{m) is also monotonically increasing with m. Therefore, the bandit is 
indexable. 

Next, we prove that Ty(a;) = lim^^i Wpioj) is indeed Whittle's index under the average reward 
criterion. For a belief state uj of an arm, its Whittle's index is the infimum subsidy m such that 
uj is in the passive set under the optimal policy for the arm, i.e., the infimum subsidy m such 
that UJ < tu*{m) (according to Lemma |7]). From (l68l) and the monotonicity of W{uj) with cu, we 
have that W{uj) is the infimum subsidy m such that cu < uj*{m). 

Appendix G: Proof of The Structure of Whittle's Index Policy 

The proof is an extension of the proof given in [10] under single-channel sensing {K = 1). 
Consider the belief update of unobserved channels (see ([I])). 

T^{uj) = poi + uj{pn -Poi)- (69) 

We notice that T^{lo) is an increasing function of cu for pu > poi and a decreasing function of 
u! for Pu < pqi. Furthermore, the belief value uji{t) of channel i in slot t is bounded between 
Poi and pii for any i and t > 1 (see ([I]))- 

Consider first pu > poi- The channels observed to be in state 1 in slot t — 1 will achieve 
the upper bound pu of the belief value in slot t while the channels observed to be in state 
the lower bound poi- Whittle's index policy, which is equivalent to the myopic policy, will stay 
in channels observed to be in state 1 and recognize channels observed to be in state as the 
least favorite in the next slot. The unobserved channels maintains the ordering of belief values in 
every slot due to the monotonically increasing property of T^{uj). The structure of Whittle index 
policy for pu < poi can be similarly obtained by noticing that reversing the order of unobserved 
channels in every slot maintains the ordering of belief values due to the monotonically decreasing 
property of T^{uj). 

Appendix H: Proof of Theorem [5] 

The proof for the lower bound of is an extension of that with single-channel sensing 
(K = 1) given in [10]. It is, however, much more complex to analyze the performance of 
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Whittle's index policy when K > 1. The lower bound obtained here is looser than that in [10] 
when applied to the case of 7^ = 1. 

Define a transmission period on a channel as the number of consecutive slots in which the 
channel has been sensed. Based on the structure of Whittle index policy, it is easy to show that 

^ = < , (70) 

[ ^ipj ; if Pu < Poi 

where E[r] is the average length of the transmission period over the infinite time horizon. 

To bound the throughput J„,, it is equivalent to bound the average length of the transmission 
period E[r] as shown in equation (iTOl) . We consider the following two cases. 

• Case 1: pu > poi 

Let uj denote the belief value of the chosen channel in the first slot of a transmission period. 
The length r(uj) of this transmission period has the following distribution. 

Pr[rH = /] = <( ' . (71) 

i^Pu Pio, I > 1 

It is easy to see that if u' > uj, then t{uj') stochastically dominates t{uj). 

From the structure of Whittle index policy, uj = T''{poi), where k is the number of consecutive 
slots in which the channel has been unobserved since the last visit to this channel. When the user 
leaves one channel, this channel has the lowest priority. It will take at least [ ^^-^ J slots before 
the user returns to the same channel, i.e., k > [^J — 1. Based on the monotonically increasing 
property of the A;-step transition probability T^(poi) (see Fig. [3]), we have cu = T^ipm) > 
T^^^~^{poi). Thus T(TLf J"^(poi)) is stochastically dominated by t{uj), and the expectation of 
the former leads to the lower bound of J^j given in (|47l) . 

• Case 2: pu < poi 

In this case, t{uj) has the following distribution: 

, cu, 1 = 1 

Pr[r(cu) = /]=<(' _ . (72) 

(l-^)PooVi. 

Opposite to case 1, t{uj') stochastically dominates t{uj) if lu' < uj . 
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From the structure of Whittle's index policy, uj = T^{pii), where k is the number of 
consecutive slots in which the channel has been unobserved since the last visit to this channel. 
If k is odd, then T^(pii) > T'^'^^'^~'^(pii) since 2[^J — 2 is an even number (see Fig. S)). If 
k is even, then k is at least 2\J^^-^\. we have u = T'^{pii) > T'^^^^^'^{pii). Thus t{uj) is 
stochastically dominated by t{T'^^^^^'^{pii)), and the expectation of the latter leads to the lower 
bound of as given in (|48l ). 

Next, we show the upper bound of J. From (|43l) . we have J < mi.in,{NJm — m{N — K)} 
since channels are stochastically identical. 

When pii > pqi, we have 

J< min N Jm - m{N - K) =mm{ ^ ,NuJo}. (73) 

When pii > Pqi, we have 

J< min NJ^-m{N-K)=mm{- , ,NuOo}. (74) 

Appendix I: Proof of Corollary [2] 

It has been shown that the myopic policy is optimal when K = 1 and pu > poi [10], [11] 
(note that for = 2, 3 negatively correlated channels, the optimality of the myopic policy has 
also been established). Based on the equivalency between Whittle's index policy and the myopic 
policy, we conclude that Whittle index policy is optimal for = 1 and pn > poi- 

We now prove that Whittle's index policy is optimal when K = N — 1. We construct a genie- 
aided system where the user knows the states Si{t) of all channels at the end of each slot t. In 
this system. Whittle's index policy is clearly optimal, and the optimal performance is the upper 
bound of the original one. For the original system where the user only knows the states of the 
sensed — 1 channels, we notice that the channel ordering by Whittle's index policy in each 
slot is the same as that in the genie-aided system. Whittle's index policy thus achieves the same 
performance as in the genie-aided system. It is thus optimal. 

According to Theorem [51 we arrive at the following inequalities (notice that > Kujo). 

max{l -pii }, if Pn > Poi 
V>{ ■ (75) 



From (TTSl) . we have t] > ^. 
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Next, we show that Whittle's index policy achieves at least half the optimal performance for 
negatively correlated channels {pu < poi)- In this case, we have 

l-T\pu)+Poi _^ ^ (pii-Poi)(l-Pii) 

1-pii+Poi l-(j'ii-Poi) ~ 2-pu ~ ' ' 
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